472 research outputs found
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
As one popular modeling approach for end-to-end speech recognition,
attention-based encoder-decoder models are known to suffer the length bias and
corresponding beam problem. Different approaches have been applied in simple
beam search to ease the problem, most of which are heuristic-based and require
considerable tuning. We show that heuristics are not proper modeling
refinement, which results in severe performance degradation with largely
increased beam sizes. We propose a novel beam search derived from
reinterpreting the sequence posterior with an explicit length modeling. By
applying the reinterpreted probability together with beam pruning, the obtained
final probability leads to a robust model modification, which allows reliable
comparison among output sequences of different lengths. Experimental
verification on the LibriSpeech corpus shows that the proposed approach solves
the length bias problem without heuristics or additional tuning effort. It
provides robust decision making and consistently good performance under both
small and very large beam sizes. Compared with the best results of the
heuristic baseline, the proposed approach achieves the same WER on the 'clean'
sets and 4% relative improvement on the 'other' sets. We also show that it is
more efficient with the additional derived early stopping criterion.Comment: accepted at INTERSPEECH202
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both
for convergence and final performance. In some experiments, we also use an
auxiliary CTC loss function to help the convergence. In addition, we train long
short-term memory (LSTM) language models on subword units. By shallow fusion,
we report up to 27% relative improvements in WER over the attention baseline
without a language model.Comment: submitted to Interspeech 201
ΠΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½Π°Ρ ΠΈ Π΄ΡΡ ΠΎΠ²Π½Π°Ρ ΠΊΡΠ»ΡΡΡΡΠ° Π°ΡΠΌΡΠ½ ΠΠ°ΡΠ°Π±Π°Ρ Π°: ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΡΠ°Π·Π²ΠΈΡΠΈΡ ΠΈ ΡΠΎΡ ΡΠ°Π½Π΅Π½ΠΈΡ Π½Π°ΡΠΈΠΎΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΡΠ»ΡΡΡΡΠ½ΠΎΠ³ΠΎ Π½Π°ΡΠ»Π΅Π΄ΠΈΡ Π² 1920-1990-Ρ Π³Π³.
Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡΡΡ Π²ΠΎΠΏΡΠΎΡΡ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½ΠΎΠΉ ΠΈ Π΄ΡΡ
ΠΎΠ²Π½ΠΎΠΉ ΠΊΡΠ»ΡΡΡΡΡ Π°ΡΠΌΡΠ½ ΠΠ°ΡΠ°Π±Π°Ρ
Π°, Π° ΡΠ°ΠΊΠΆΠ΅ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΡΠ°Π·Π²ΠΈΡΠΈΡ ΠΈ ΡΠΎΡ
ΡΠ°Π½Π΅Π½ΠΈΡ ΠΈΡ
Π½Π°ΡΠΈΠΎΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΡΠ»ΡΡΡΡΠ½ΠΎΠ³ΠΎ Π½Π°ΡΠ»Π΅Π΄ΠΈΡ Π² 1920-1990-Ρ
Π³Π³
ΠΠ± ΠΎΠ΄Π½ΠΎΠΌ ΠΌΠ΅ΡΠΎΠ΄Π΅ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΡΠ°Π±ΠΎΡΠΎΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡΠΈ ΡΠ΄Π²ΠΈΠ³Π°ΡΡΠ΅Π³ΠΎ ΡΠ΅Π³ΠΈΡΡΡΠ°
ΠΠΏΠΈΡΡΠ²Π°Π΅ΡΡΡ ΡΡΠ½ΠΊΡΠΈΠΎΠ½Π°Π»ΡΠ½Π°Ρ ΡΡ
Π΅ΠΌΠ° ΡΡΡΡΠΎΠΉΡΡΠ²Π° ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΡΠ°Π±ΠΎΡΠΎΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡΠΈ ΡΠ΄Π²ΠΈΠ³Π°ΡΡΠ΅Π³ΠΎ ΡΠ΅Π³ΠΈΡΡΡΠ°, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π½Π° ΠΌΠ΅ΡΠΎΠ΄Π΅ ΡΡΠ΅ΡΠ° Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ ΡΠ΄Π²ΠΈΠ³Π° Π΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ΅ΡΠ΅Π· ΡΠ΅Π³ΠΈΡΡΡ. Π ΡΡΡΡΠΎΠΉΡΡΠ²Π΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π΄Π²Π΅ Π΄Π²ΡΡ
Π²Ρ
ΠΎΠ΄ΠΎΠ²ΡΠ΅ ΡΡ
Π΅ΠΌΡ ΡΠΎΠ²ΠΏΠ°Π΄Π΅Π½ΠΈΡ, Π»ΠΈΠ½ΠΈΡ Π·Π°Π΄Π΅ΡΠΆΠΊΠΈ
- β¦